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ABSTRACT 

Managers of systems of shared resources typically have many 
• separate goals. Examples are efficient utilization of the re- 
sources among its users and ensuring no user’s satisfaction 
in the system falls below a preset minimal level. Since such 
goals will usually conflict with one another, either implicitly 
or explicitly the manager must determine the relative impor- 
tance of the goals, encapsulating that into an overall utility 
function rating the possible behaviors of the entire system. 
Here we demonstrate a distributed, robust, and adaptive 
way to optimize that overall function. Our approach is to 
interpose adaptive agents between each user and the sys- 
tem, where each such agent is working to maximize its own 
private utility function. In turn, each such agent’s func- 
tion should be both relatively easy for the agent to learn to 
optimize, and “aligned” with the overall utility function of 
the system manager — an overall function that is based on 
but in general different from the satisfaction functions of the 
individual users. To ensure this we enhance the Collective 
INtelligence (COIN) framework to incorporate user satisfac- 
tion functions in the overall utility function of the system 
manager and accordingly in the associated private utility 
functions assigned to the users’ agents. We present exper- 
imental evaluations of different COIN-based private utility 
functions and demonstrate that those COIN-based functions 
outperform some natural alternatives. 

1. INTRODUCTION 

One of the key problems confronting the designers of large- 
scale, distributed agent applications is control of access to 
shared resources. In order for the system to function well, 
access to these shared resources must be coordinated to elim- 
inate potential problems like deadlock, starvation, livelock 
and other forms of resource contention. Without proper 
coordination, both individual user satisfaction and overall 
system performance can be significantly impaired. 
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As hand-coded, static procedures are likely to be brittle in 
complex environments, we are interested in flexible, adap- 
tive, scalable approaches to the resource sharing problem [2]. 
One approach is to use distributed machine learning-based 
agents each of which works exclusively to increase the sat- 
isfaction of its associated user. While this is appealing, in- 
dividual learners optimizing the “satisfaction” utility func- 
tions of their users typically will not coordinate their activi- 
ties and may even work at cross-purposes, thereby degrading 
the performance of the entire system. Well-known cases of 
this problem of global shared resources include the Tragedy 
of the Commons [4]. Such problems highlight the need for 
careful design of the utility functions each of the learning 
agents work to optimize. In general, this means having each 
agent use a utility function that differs from the satisfaction 
function of its associated user [11]. In addition to ensuring 
that the private utility functions of the agents do not induce 
them to work at cross-purposes though, it is also necessary 
that those functions can be readily optimized in an adaptive 
manner, despite high noise levels in the environment. 

The field of mechanism design, originating in game the- 
ory and economics, appears to address this problem. How- 
ever it suffers from restrictions that diminish its suitability 
for problems — like the ones considered here — involving 
bounded-rational, non-human, noisy agents, where the vari- 
ables controlled by the agents and even the number of agents 
can be varied [15]. In contrast, the Collective INtelligence 
(COIN) framework is explicitly formulated to address such 
problems without such restrictions [16, 11, 14]. In particu- 
lar, its mathematics derives private utility functions to pro- 
vide the individual learning agents that are leamablcj in ad- 
dition to inducing the agents not to work at cross-purposes. 
Such functions are designed both so that the agents can 
perform well at maximizing them, and so that as they do 
this the agents also “unintentionally” improve the provided 
“world utility” function rating behavior of the entire system. 

In the past, COIN techniques have been used for world util- 
ity functions that are fully exogenous, in that they do not 
reflect any satisfaction functions of a set of users of the sys- 
tem nor how best to accommodate those satisfaction func- 
tions. While achieving far superior performance in the do- 
mains tested to date than do conventional approaches like 
greedy agents and team games [10, 16, 13], these tests do not 
assess how well COIN-based systems perform in situations 



like shared resource problems, in which the world utility is 
not fully exogenous. In this paper we start by reviewing 
the mathematics of collective intelligence. We then test the 
techniques recommended by that mathematics on shared re- 
source problems, using a world utility function that reflects 
the satisfaction levels of the individual users of the system, 
and in particular trades off concern that no individual’s sat- 
isfaction be too low with concern that the overall system 
behave efficiently. We demonstrate that in such domains 
COIN techniques also far outperform approaches like team 
games. We also validate some quantitative predictions of 
the COIN mathematics concerning how the world utility is 
affected by various modifications to the COIN techniques. 

2. BACKGROUND ON COLLECTIVE IN- 
TELLIGENCE 

The “collective INtelligence” framework (COIN) concerns 
the design of large distributed collectives of self-interested 
agents where there exists a world utility function measur- 
ing the performance of the entire collective. The empha- 
sis of the research is on the inverse problem: given a set 
of self-interested agents and a world utility function, how 
should a designer configure the system, and in particular 
set the private utility functions of the agents so that, when 
all agents try to optimize their functions, the entire col- 
lective performs better. In this section, we introduce the 
mathematics of designing collectives. Because of space lim- 
itations, the concepts cannot be justified or elaborated in 
any detail; references to other papers with such details are 
provided. 

2.1 Central Equation 

Let C be an arbitrary space whose elements z give the joint 
move of all agents in the collective system. We wish to 
search for the z that maximizes the provided world utility 
function Q{z), In addition to Q we are concerned with pri- 
vate utility functions one such function for each agent 
77 controlhng Zrj. We use the notation" 77 to refer to aU agents’ 
other than 77. 

We need a way to “calibrate” the utility functions so that 
the value assigned to a joint action z reflects the ranking 
of z relative to all the other possible joint actions in C We 
denote as intelligence the result of this calibration process. 
The intelligence can be considered to be the percentile of 
states that would have resulted in having a worse utility 
(if 99% of the available actions lead to a worse utility, the 
agent made a smart decision). In the COIN framework, the 
“intelligence for rj aX z with respect to U” is defined by: 

Nr,M^) = {Z')Q[U{Z) - U{Z')\ , ( 1 ) 

where © is the Heaviside function ^ , and where the subscript 
on the (normalized) measure d/x indicates it is restricted to 
z' sharing the same non- 77 components as z. There is no 
particular constraint on the measure ^ other than reflecting 
the type of system (whether C is countable or not, if not, 
what coordinate system is being used...). 

^The Heaviside function is defined to equal 1 when its ar- 
gument is greater or equal to 0, and to 0 otherwise. 


As system designer, our uncertainty concerning the state of 
the system is reflected in a probability distribution P over C* 
Our ability to control the system consists of setting the value 
of some characteristic of the collective, which is denoted as 
its design coordinate. For instance, the design coordinate 
can be setting the private utility functions of the agents. 
For a value s of the design coordinate, our analysis revolves 
around the following central equation for P{Q \ s), which 
follows from Bayes’ theorem: 


P{5 I s) = (2) 

I dNgP{g \ Ng,s) I dN,P{Ng\N„sl P{N^ , 

" ! ''' factoredness learnability 

explore vs. exploit 

where Ng and Ng are the intelligence vectors of the agents 
with respect to the 'private utility Qtj of 77 and the world 
utility 5, respectively. 

Note that Nrj,gr,{z) = 1 means that agent 77 is fully rational 
at z, in that its move maximizes the value of its utility, given 
the moves of the agents. In other words, a point z where 
^^,97) (^) = 1 agents 77 is one that meets the definition 

of a game-theory Nash equilibrium. On the other hand, a z 
at which all components of TVg = 1 is a maximum Q along 
all coordinates of z. So if we can get these two points to be 
identical, then if the agents do well enough at maximizing 
their private utilities we are assured we will be near an axis- 
maximizing point for Q. 

To formalize this, consider our decomposition of P{Q \ s). 

• learnability: if we can choose the global coordinate s 
so that the third conditional probability in the inte- 
grand is peaked around vectors Ng all of whose com- 
ponents are close to 1 (that is agents are able to learn 
their tasks), then we have likely induced large (private 
utility function) intelligences. Intuitively, this ensures 
that the private utility functions have high “signal-to- 
noise” . 

• factoredness: by choosing s, if^we can also^have the 
second term be peaked about Ng equal to Ng (that is 
the private utility and the world utility are aligned), 
then Ng will also be large. It is in the second term 
that the requirement that the private utiHty functions 
be “aligned with Q” arises. Note that our desired form 
for the second term in Equation 2 is assured if we have 
chosen private utilities such that Ng equals Ng exactly 
for all z. Such a system is said to be factored. 

• Finally, if the first term in the integrand is peaked 
about high Q when Ng is large, then our choice of s 
will likely result in high Q, as desired. 

On the other hand, unless one is careful, each agent will 
have a hard time discerning the effect of its behavior on 
its private utility when the system is large. Typically, each 
agent has a horrible “signal-to-noise” problem in such a sit- 
uation. This will result in a poor term three in the central 



equation. For instance, consider a collective provided by 
the human economy. A team game in that example would 
mean that every citizen gets the national GDP as her/his 
reward signal, and tries to discern how best to act to maxi- 
mize that reward signal. At the risk of understatement, this 
would provide the individual members of the economy with 
a difficult reinforcement learning task. 

In this paper, we concentrate on the second and third terms, 
and show how to simultaneously set them to have the desired 
form in the next section. Hence, the research problem is to 
choose s so that the system both has good learnabilities for 
its agents, and is factored. 

Mechanism design might, at first glance, appear to provide 
us techniques for solving this version of the inverse problem. 
However while it can be viewed as addressing the second 
term, the issue of learnability is not studied in mechanism 
design. Rather mechanism design is almost exclusively con- 
cerned with collectives that are at (a suitable refinement of) 
an exact Nash equilibrium [6]. That means that every agent 
is assumed to be performing as well as is theoretically possi- 
ble, given the behavior of the rest of the system. In setting 
private utilities and the like on this basis, mechanism design 
ignores completely the issue of how to design the system so 
that each of the agents can achieve a good value of its pri- 
vate utility (given the behavior of the rest of the system). In 
particular it ignores all statistical issues related to how well 
the agents can be expected to perform for various candidate 
private utilities. 

Such issues become crucial as one moves to large systems, 
where each agent is implicitly confronted with a very high- 
dimensional reinforcement learning task. It is its ignoring of 
this issue that means that mechanism design scales poorly to 
large problems. In contrast, the COIN approach is precisely 
designed to address both learnability issues as well as term 
2 . 

2.2 Wonderful Life Utility and Aristocrat Util- 
ity 

As an example of the foregoing, any “team game” in which 
all private utility functions equal Q is factored [Ij. How- 
ever as illustrated above in the human economy example, 
team games often have very poor forms for term 3 in Equa- 
tion 2, forms which get progressively worse as the size of 
the collective grows. This is because for such private util- 
ity functions each agent 77 will usually confront a very poor 
“signal-to-noise” ratio in trying to discern how its actions 
affect its utility grj = Q, since so many other agent’s actions 
also affect Q and therefore dilute 77 ’s effect on its own private 
utility function. 

We now focus on algorithms based on private utility func- 
tions {5,7} that optimize the signal/noise ratio reflected in 
the third term of the central equation, subject to the require- 
ment that the system be factored. We will introduce two 
utility functions satisficing this criteria, namely the Won- 
derful Life Utility (WLU) and the Aristocrat Utility (AU) 

To understand how these algorithms work, say we are given 
an arbitrary function f{zrj) over agent 77 ’s moves, two such 
moves Zrj^ and Zr^ , s. utility U , a value s of the design coor- 


dinate, and a move by all agents other than 77, Define 
the associated learnability by 

Kf{U\Z‘-r\-) S,Zt^ , Z-rj )= (3) 

Jdz,[f{z,)Var(U;z^,,Zr,)] ‘ 

The expectation values in the numerator are formed by 
averaging over the training set of the learning algorithm 
used by agent rj, Tin. Those two averages are evaluated ac- 
cording to the two distributions P{U\nn)P(nrj\z-n,^rj^) and 
P{U\nn)P{nn\z'rji^ri^)^ respectively. (That is the meaning 
of the semicolon notation.) Similarly the variance being 
averaged in the denominator is over Urj according to the 
distribution P{U\nn)P{nn\z'n, ^rj). 

The denominator in Equation 4 reflects how sensitive U (z) is 
to changing z*tj. In contrast, the numerator reflects how sen- 
sitive U (z) is to changing Zjj . So the greater the learnability 
of a private utility function pn, the more gn(z) depends only 
on the move of agent 77, i.e., the better the associated signal- 
to-noise ratio for 77. Intuitively then, so long as it does not 
come at the expense of decreasing the signal, increasing the 
signal-to-noise ratio specified in the learnability will make it 
easier iot 77 to achieve a large value of its intelligence. This 
can be established formally: if appropriately scaled, p' will 
result in better expected intelligence for agent 77 than will 
whenever > A/(5^;z-.,,s,z.,^z^^) 

for all pairs of moves z,,^,Zt,^[ 14]. 

One can solve for the set of all private utilities that are 
factored with respect to a particular world utility. Unfor- 
tunately though, in general a collective cannot both be fac- 
tored and have infinite learnability for all of its agents [14]. 
However consider difference utilities, of the form 

U(z)=l 3 [giz)-D{z-^)] (4) 

Any difference utility is factored [14]. In addition, for all 
pairs Zrj^,Zn^, under benign approximations, the difference 
utility maximizing Af{U;z-n,s,Zn^,Zrj^) is found by choos- 
ing 

D{z~n)^Ef{g(z)\z^n^^). (5) 

up to an overall additive constant, where the expectation 
value is over Zrj. We call the resultant difference utility the 
Aristocrat utility (AU), loosely reflecting the fact that it 
measures the difference between a agent’s actual action and 
the average action. If each agent 77 uses an appropriately 
rescaled version of the associated AU as its private utility 
function, then we have ensured good form for both terms 2 
and 3 in Equation 2. 

Using AU in practice is sometimes difficult, due to the need 
to evaluate the expectation value. Fortunately there are 
other utility functions that, while being easier to evaluate 
than AU, still are both factored and possess superior learn- 
ability to the team game utility, gn Q. One such private 
utility function is the Wonderful Life Utility (WLU). The 
WLU for agent 77 is parameterized by a pre- fixed clamping 
parameter CLn chosen from among 77 ’s possible moves: 

WLUn = Q{z)^g{z^n^CLn) . 


(6) 


WLU is factored no matter what the choice of clamping pa- 
rameter. Furthermore, while not matching the high learn- 
ability of AU, WLU usually has far better learnability than 
does a team game, and therefore (when appropriately scaled) 
results in better expected intelligence [11, 16, 14). 


We have experimented with the following world utilities: 
• Avoid hurting any user: minimize = maXi Vi, 


• Minimize a linear combination of the satisfaction of indi- 
vidual users where the weight associated with a user rep- 

Let consider a set of users and a set of m shared resources/machinespesents the importance of that particular user: 5(Z>i) = 


3. PROBLEM DEFINITION 


Different users have different task loads, each task being of 
one of T types. Any machine can perform any task type, but 
different machines have different processing speeds for each 
task types. A given user sends all of its tasks of a particular 
type to a specific machine but can send tasks of different 
types to different machines. So each user must decide, for 
each j € T, to what machine to send all of its tasks of 
type j. 


Ei wiPi. 


• To have high average satisfaction without any user’s satis- 
faction being low, minimize Q(Vi) = /3 ^ av + WiVi, <tv 
being the variance in the dissatisfactions. 


We consider a batch mode scenario, in which every user 
submits all of its tasks, the machines complete their tasks, 
and the associated overall performance of the system is as- 
certained. Each machine works by grouping all the tasks 
sent to it by type, and performs all tasks of one type con- 
tiguously before returning them to the associated users and 
then switching to the next type. The order of processing 
different task types is fixed for any given machine. Each 
type is performed at a speed specific to the machine. 

3.1 Satisfaction functions of users 

User % has a personal “satisfaction” utility function, H,, 
which is an inverse function of the time that the user has 
to wait for all of his tasks to finish. The user may be less 
willing to wait for certain task types compared to others. 
The user can provide feedback or his model of satisfaction 
to the system manager. 

We indicate the completion time of the tasks of type j sub- 
mitted by user i to machine A{ by CT{ . 

The functions defined below measure dissatisfaction rather 
than satisfaction, in that they are monotonically increasing 
functions of the delay. So the goal for agent i is to maximize 
1-ii = — where Vi is defined as one of the following: 

• the maximum delay for user i, Vi — maxy^.(i..T} CTl, 

• an importance- weighted combination of the time of com- 
pletion of all the tasks of the user, Vi = YlJ=i . 

V = {Vi} is the set of the dissatisfactions of all users. 

3.2 De£nition of the World Utility Function g 

The world utility function measures the performance of the 
overall system. It is the responsibility of the system designer 
to set this function. He might decide to measure perfor- 
mance by measuring some characteristics of the use of the 
resource only (e.g. the load distribution or the idle time). 
The system designer might also incorporate user preferences 
to measure the performance of the system. In our simula- 
tions, we have focused on that aspect by defining the world 
utility function as a function of the dissatisfactions of all 
users. This assumes that either the users have provided a 
model of their preferences to the system designer or that 
the system designer has modeled them. In the former case, 
users may be allowed to update their model. 


3.3 Agent learning algorithms 

Once the measure of dissatisfaction of each agent and the 
performance metric of the overall system have been defined, 
the remaining question is how to improve the performance. 
In the experiments, we want to answer the question: what 
is the function that each self interested agent has to im- 
prove? We refer to this function as the private utility 
for the agent. In other words, given the agents’ satisfaction 
model and the performance metric of the system, we want 
to be able to tell the agents that the best way to improve the 
performance of the system (which include their own prefer- 
ences) is to optimize a certain private utihty. If they decide 
to optimize other criteria, the performance of the system 
will not be as good as if they follow the recommendation. 

We compared performance ensuring from four private util- 
ity functions for the agents, the last two being those recom- 
mended by the COIN framework: 

• a team game; each agent’s utility equals i.e. by choosing 
its action, each agent is trying to optimize the performance 
of the overall system (the agent only focuses on the result 
of the team, it does not consider its own performance); 

• a greedy game: each agent I’s private utility is Hi (the 
agent focuses only on its performance); 

• AU; to compute AU for an agent i need to evaluate 

Q{ijj) where Q{i,j) denotes the world utility when 
agent i sends its job to machine j while all other agents send 
their jobs to the same machines they actually used. Thus, 
we need to re-evaluate the completion times of all machines 
m when the other agents maintain their decision while agent 
i sending its load to m). See [14] for a discussion of ejfficient 
evaluation of AU; 

• WLU; to compute WLU for agent i, we chose to clamp to 

“null” the action of i. Hence, WLUi — Q — Thus we 

only need to evaluate what the completion time of the ma- 
chine chosen by i would be if tj did not send any load to that 
machine. See [14] for a discussion of efficient evaluation of 
AU. 

We had each agent use an extremely simple reinforcement 
learning algorithm in our experiments, so that performance 
more accurately reflects the quality of the agents’ utility 
functions rather than the sophistication of the learning scheme. 
In this paper, each agent is “blind” , knowing nothing about 
the environment in which it operates, and observing only 
the utility value it gets after an iteration of the batch pro- 


cess. The agents each used a modified version a Boltz- 
mann learning algorithm to map the set of such utility val- 
ues to a probability distribution over the (finite) set of pos- 
sible moves, a distribution that is then sampled to select 
the agent’s move for the next iteration. Each agent tries 
to learn the estimated reward Ri corresponding to the ac- 
tion i. An agent will choose the action i with probabil- 
ity Vi = * . The parameter /? is the inverse 

Eall actions j * ^ 

temperature. During round r, if the action i is chosen, 
the update of the estimated reward for action i is .Ri = 
£ ’ * • where C» denotes the set of rounds 

where action i was chosen, and Rj denotes the reward re- 
ceived at the end of the round j. The data aging parameter 
a models the importance of the choices made in the previous 
rounds. In our modification, rather than have the expected 
utility value for each possible move of an agent be a uniform 
average of the utilities that arose in the past when it made 
that move, we exponentially discount moves made further 
into the'past, to reflect the non-stationarity of the system. 

4. EXPERIMENTS 

We experiment with a domain where users are sending differ- 
ent types of printing jobs to a shared pool of heterogeneous 
printers. The different types of printing jobs may represent 
different kind of printing operations (for instance, printing 
black & white, printing in color) Though each printer is 
able to process any job type, a given printer processes jobs 
of different types at different speeds. For each printer, we 
randomly choose the speed of processing each task type, and 
processing occurs in the order of decreasing speed. 

Each user has a load of printing jobs to perform. We as- 
sign one agent to handle all tasks of a given type for a given 
user, and its job is to select a printer to which these tasks 
would be sent. Each run starts with an initial sequence of 
iterations in which the agents make purely random moves to 
generate data for the learning algorithms. We then succes- 
sively “turn on” more and more of those algorithms i.e., at 
each successive iteration, a few more agents start to make 
their moves based on their learning algorithm and associ- 
ated data rather than randomly. We start each agent with a 
high Boltzmann temperature, usually 10, and decrease it at 
each iteration by multiplying it by a decay factor. We refer 
to a temperature schedule as slow when the decay factor is 
large (for instance 99%), and refer to the schedule as fast 
when the decay factor is small (for instance 70%). We use 
an aging parameter in forming expectation values of 0.1. 

In the experiments reported here, the dissatisfaction of a 
given user is Vi — where the aj are randomly 

chosen and remain the same throughout the learning pro- 
cess. We used 10 users, 8 types and 6 machines. 

4.1 Results 

We first considered the case where the World Utility Func- 
tion Q is computed as the maximum of the dissatisfaction 
of the user, i.e., Q = maxi Vi. The goal is to minimize the 
maximum dissatisfaction among the users. The decay fac- 
tor used is 0.99. In Fig. 1 we provide performance for the 
different agent utility functions averaged over 10 runs. The 
performances achieved by AU and WLU are similar, and 


comparaison between Greedy, Team Game, AU and WLU 
temperature schedule uses a decay of 99% 



Figure 1: Comparison between different agent util- 
ities when Q = maXi Vi 

dominate that of both Team Game and Greedy agent utih- 
ties. We found that at the end of the learning process, the 
standard deviation in Q for AU and WLU is small (respec- 
tively 2.1 and 1.08) whereas it is large for Greedy and Team 
Game (respectively 14.4 and 24). This corroborates the rea- 
soning that signal to noise is too large when we use Greedy 
or Team Game, and therefore the learning agents are not 
able to determine the impact of their actions on their utility 
functions. In contrast, as discussed above, AU and WLU are 
designed to have good signal to noise, which enables them 
to reach better (and more consistent) performance levels. 

Note that with the greedy approach Q worsens as each agent 
learns how to maximize its utility. This is an example of a 
tragedy of the commons: since the Greedy utility is not 
factored with respect to Q, the associated Nash equilibrium 
— achieved as the agents learn more — has poor Q. 

In Fig. 2, Q is instead the weighted sum of the dissatisfaction 
of the users, i-Q-j G = In. figure 3 it is the sum of 

all the users dissatisfactions with the standard deviation of 
those dissatisfactions, i.e., Q — h.a.ve used 

/? = 1). The decay factor used in these experiments is 0.99. 
In all cases, AU and WLU achieve comparable performance, 
and outperform team game. 

Although the asymptotic performance of the team game is 
not as good as that of AU or WLU, in both of these exper- 
iments team game performs better than AU or WLU in the 
early phase of the learning. In addition the convergence in 
time of Q for AU and WLU is slower in these experiments. 
This is consistent with COIN theory, which says that simply 
changing to a more learnable utility function may actually 
lead to worse performance if the learning temperature is not 
changed simultaneously. Intuitively, to achieve their better 
signal to noise ratios, AU and WLU shrink both the signal 
and the noise, and therefore need both to be rescaled up- 
ward. In the context of Boltzmann learning algorithms, this 
is equivalent to lowering their temperatures. 

To investigate this issue, in a second set of experiments 
we used different decay factors and faster schedules. In 




Figure 2: Compaxison between different reward in 
the case where Q = Y^^WiVi^ 


comparaison with temperature decay = 99% 



iterations 

Figure 3: Performances for Q = <j Yi 


Fig. 4 we present the asymptotic world utility value for 
Team Game, AU, and WLU for different decay factors and 
starting temperatures. In the experiments plotted we used 
Q — observed the same phenomena with 

the other Q discussed above: the team game never outper- 
forms AU or WLU. Moreover, if for each time-algorithm pair 
we used the schedule that is optimal for that pair, then is no 
time t at which the performance of the team game is better 
than the performance of AU/WLU. Fig. 5 illustrates this, 
showing that with a fast schedule having a decay factor of 
0.65, AU and WLU converge faster than team game. 

In addition to these advantages, we found that the perfor- 
mance with the team game is more sensitive to the change 
in temperature schedule, again in accord with COIN theory. 
Indeed, although convergence to asymptotia is faster for all 
agent utilities with the faster schedule (see Fig. 5), from 
Fig. 4 we see that both average asymptotic performance and 
its standard deviation are substantially for the team game 
for the faster schedule. In contrast, AU and WLU do not 
suffer as much from the switch to a faster schedule. 


comparaison with temperature decay = 65% 



iterations 


Figure 5: Performance for Q = 


We also investigated the scaling properties of these approaches 
as one increases the size of the system and found that AU 
and WLU scale up much better than the team game ap- 
proach, again in accord with COIN theory. Intuitively, the 
larger the system, the worse the signal-to-noise problems 
with utilities like team games, and therefore the greater the 
gain in using a utility like AU or WLU that corrects for it. 
In Fig. 6, we used a fixed number of types, 8, and a fixed 
ratio of 1 machine for 5 users. We increased the number 
of users from 5 to 25. For each data point, we considered 
a few scenarios (different machine configurations, different 
loads distributed to users, etc.) and performed 10 runs for 
each scenario. Each point represents the average over the 
scenarios. The use of AU and WLU yield comparable re- 
sults, and the difference in performance between them and 
Team Game increases with the number of agents. 

Since in general one cannot compute the best possible Q 
value, we cannot make absolute performance claims. Nonethe- 
less, both in terms of performance and scaling, use of WLU 
or AU is clearly preferable to a team game or greedy ap- 
proach, for several different choices for Q. 

5 . RELATED WORK 







Problems of different sizes 
8 types, 1 machine for 5 users 



Figure 6: Scaling properties for Q = 


Multiagent system researchers have studied balancing of load 
across shared resources with different decision-making pro- 
cedures. These approaches include the following: 

• Studying chaotic nature of resource loads when each agent 
uses a greedy selection procedures [5]. 

• Effect of limited knowledge on system stability [7, 9]. 

• Market mechanisms for optimizing quality-of-service [17]. 

• Using reinforcement learning to balance loads [8]. 

• Social dilemma problems that arise when individual agents 
try to greedily exploit shared resources [3, 11]. 

• Distinguishing easy vs. difficult resource allocation [12]. 

The typical assumption in most of this work is that each user 
has an atomic load to send to one of several equivalent- re- 
sources. Our work addresses a more general scenario where 
a user has multiple task loads and resources are heteroge- 
neoxis. So not only may the decisions of the users conflict, 
but decisions for different tasks taken by the same user can 
interfere with each other. Also, we explicitly handle the 
issue of user satisfaction metrics (something awkward to in- 
corporate in load-balancing, for example) and variability in 
the world utility, and use the COIN procedure to ensure 
that the combination of individual user satisfaction metrics 
are optimized. The combination can include weights for dif- 
ferent users and also, if necessary, reduce disparate levels of 
satisfaction from the outcome by minimizing the standard 
deviation of satisfaction. With the COIN procedure, all of 
this is done using very simple agents, in a highly paralleliz- 
able fashion, with no modeling of the underlying dynamics 
of the system. 

6. DISCUSSION 

The COIN framework has been applied to many domains, 
including domains requiring sequence of actions. In these 
cases, the world utility function was decided first, and then 
the local utility function of each agent was derived from it. 
This paper differs in two main ways from the COIN appli- 
cations studied so far. First, agents have preferences repre- 
sented by a dissatisfaction function. In previous work, the 
individual in the system did not have any intrinsic prefer- 
ences. Secondly, the world utility is based upon the prefer- 
ences of users and a system administrator. The main con- 


tribution of the paper is to show that the users, in order 
to achieve satisficing performance of the overall system, had 
better prefer to optimize COIN based private utilities. The 
results demonstrate that even in the case where the world 
utility function is not exogenous, the previously superiority 
of COIN technique holds. In particular, the utilities AU 
and WLU outperform locally greedy approaches and more 
importantly, the Team Game approach, in terms of aver- 
age performance, variability in performance, and robustness 
against changes to the parameters of the agent’s learning al- 
gorithm. Moreover, these improvements grow dramatically 
as the system size is increased, an extremely important con- 
sideration in future applications. 

We are currently experimenting with aggregating several 
printing decisions under the jurisdiction of one agent. This 
would reduce the number of agents in the system, and pos- 
sibly both speed up the convergence of the performance and 
improve it. In particular we will investigate different kind 
of aggregations, e.g., based on users, based on task types, 
crossing both, etc. There is also the interesting possibility 
of learning the best aggregation of decisions into individual 
agents. 
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